Avoiding Boosting Overfitting by Removing Confusing Samples
نویسندگان
چکیده
Boosting methods are known to exhibit noticeable overfitting on some datasets, while being immune to overfitting on other ones. In this paper we show that standard boosting algorithms are not appropriate in case of overlapping classes. This inadequateness is likely to be the major source of boosting overfitting while working with real world data. To verify our conclusion we use the fact that any overlapping classes’ task can be reduced to a deterministic task with the same Bayesian separating surface. This can be done by removing “confusing samples” – samples that are misclassified by a “perfect” Bayesian classifier. We propose an algorithm for removing confusing samples and experimentally study behavior of AdaBoost trained on the resulting data sets. Experiments confirm that removing confusing samples helps boosting to reduce the generalization error and to avoid overfitting on both synthetic and real world. Process of removing confusing samples also provides an accurate error prediction based on the work with the training sets.
منابع مشابه
Boosting-like Deep Learning For Pedestrian Detection
This paper proposes boosting-like deep learning (BDL) framework for pedestrian detection. Due to overtraining on the limited training samples, overfitting is a major problem of deep learning. We incorporate a boosting-like technique into deep learning to weigh the training samples, and thus prevent overtraining in the iterative process. We theoretically give the details of derivation of our alg...
متن کاملFeature Selection for Descriptor Based Classification Models. 1. Theory and GA-SEC Algorithm
The paper describes different aspects of classification models based on molecular data sets with the focus on feature selection methods. Especially model quality and avoiding a high variance on unseen data (overfitting) will be discussed with respect to the feature selection problem. We present several standard approaches and modifications of our Genetic Algorithm based on the Shannon Entropy C...
متن کاملOutlier Detection by Boosting Regression Trees
A procedure for detecting outliers in regression problems is proposed. It is based on information provided by boosting regression trees. The key idea is to select the most frequently resampled observation along the boosting iterations and reiterate after removing it. The selection criterion is based on Tchebychev’s inequality applied to the maximum over the boosting iterations of ...
متن کاملBoosting by weighting boundary and erroneous samples
This paper shows that new and flexible criteria to resample populations in boosting algorithms can lead to performance improvements. Real Adaboost emphasis function can be divided into two different terms, the first only pays attention to the quadratic error of each pattern and the second takes only into account the “proximity” of each pattern to the boundary. Here, we incorporate an additional...
متن کاملPDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm.
Combinatorial therapy is a promising strategy for combating complex diseases by improving the efficacy and reducing the side effects. To facilitate the identification of drug combinations in pharmacology, we proposed a new computational model, termed PDC-SGB, to predict effective drug combinations by integrating biological, chemical and pharmacological information based on a stochastic gradient...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007